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ABSTRACT 

Environmental biosurveillance and microbial 
ecology studies use PCR-based assays to detect 
and quantify microbial taxa and gene 
sequences within a complex background of micro- 
organisms. However, the fragmentary nature and 
growing quantity of DNA-sequence data make 
group-specific assay design challenging. We 
solved this problem by developing a software 
platform that enables PCR-assay design at an 
unprecedented scale. As a demonstration, we 
developed quantitative PCR assays for a globally 
widespread, ecologically important bacterial group 
in soil, Acidobacteria Group 1. A total of 33684 
Acidobacteria 16S rRNA gene sequences were 
used for assay design. Following 1 week of compu- 
tation on a 376-core cluster, 83 assays were 
obtained. We validated the specificity of the top 
three assays, collectively predicted to detect 42% 
of the Acidobacteria Group 1 sequences, by PCR 
amplification and sequencing of DNA from soil. 
Based on previous analyses of 16S rRNA gene 
sequencing, Acidobacteria Group 1 species were 
expected to decrease in response to elevated at- 
mospheric C0 2 . Quantitative PCR results, using 
the Acidobacteria Group 1 -specific PCR assays, 
confirmed the expected decrease and provided 
higher statistical confidence than the 16S rRNA 
gene-sequencing data. These results demonstrate 
a powerful capacity to address previously intract- 
able assay design challenges. 

INTRODUCTION 

Polymerase chain reaction (PCR)-based assays to detect and 
quantify microbes in environmental samples containing a 



complex background of microbial DNA are important 
tools in national defense, public health and microbial 
ecology. Consequently, many algorithms have been 
developed for PCR-assay design. Existing algorithms can 
be divided into four categories based on the number 
of target and related non-target sequences that are evaluated 
during design: (i) a single target sequence (1-4), (fi) multiple 
target sequences (5-15), (iii) a single target sequence and 
multiple non-targets (16-18) and (iv) multiple target se- 
quences and multiple non-targets (19-24). The last 
category is the most general and most challenging 
problem — the group-specific assay design problem. 

As more sequence data is deposited in public databases, 
the number of groups that can be monitored increases. 
Target groups include pathogens and their closest relatives 
(25-27), functional groups (28-30) or broad taxonomic 
groups (30,31). However, the growing amount of 
sequence data presents substantial challenges for assay 
design. Three central challenges are (i) the variation in 
the length and overlap of available sequences, (ii) the 
absence of target-specific signatures (i.e. oligonucleotide 
sequences that are present in all target sequences and 
absent in all near neighbors) and (iii) computational 
scale, determined by the number and length of target 
and non-target sequences. The lack of algorithms able to 
confront these growing challenges makes the design of 
many group-specific assays intractable. 

Variation in the length and overlap of available 
sequences impedes assay design. Short sequences reduce 
the regions available for assays, and the lack of overlap (or 
only partial overlap) (32) forces the omission of data that 
may contain informative biological variation. For 
example, if 1200 bp sequences were required as input for 
assay design, half of the 16S rRNA gene sequences in the 
Ribosomal Database Project (RDP) (33) would be 
omitted. About 52% of sequences in the RDP (release 
10, 5 April 2011) are less than 1200 bases (the full-length 
gene is 1400-1500 bp), and the sequences represent differ- 
ent regions of the 16S rRNA gene. Variation in sequence 
length also impedes efforts to determine the extent to 



*To whom correspondence should be addressed. Tel: +1 505 667 3770; Fax: +1 505 665 3024; Email: jgans(S lanl.gov 
Present address: 

Stephanie A. Eichorst, Department of Microbial Ecology, University of Vienna, Vienna, Austria. 
© The Author(s) 2012. Published by Oxford University Press. 

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ 
by-nc/3.0), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. 



e96 Nucleic Acids Research, 2012, Vol. 40, No. 12 



Page 2 of 1 1 



which an assay or set of assays covers existing sequences. 
Current programs do not address the problem of variable 
length sequences. 

The lack of target-specific signatures also restricts assay 
design. A signature sequence is unique to, and conserved 
within, the target group and thus confers assay specificity. 
With the exception of PRISE (22), computer programs for 
group-specific assay design require signature primers 
(19-24). This approach is attractive because it is compu- 
tationally inexpensive, but it prevents the design of assays 
for groups that lack a unique signature. For such groups, 
specific assays can still be obtained by exploiting the spe- 
cificity arising from the combination of forward and 
reverse primers. That is, the primers can be individually 
non-specific, but group-specific as a pair (Figure 1). This 
strategy was used to manually design a PCR assay that 
differentiates Brucella abortus from the closely related 
B. suis, B. melitensis, and B. ovis (34). The assay targets 
genes that are present in all Brucella species but are 
uniquely arranged in B. abortus. Design programs that 
require a signature primer would not discover this 
primer pair. The PRISE program (22) is capable of de- 
signing this type of assay but cannot accommodate design 
problems on a large computational scale (determined by 
the number and length of input sequences). 

To address these limitations, we developed two 
algorithms: SeqStrap and ProSig. SeqStrap enables use 
of partially overlapping sequences that might otherwise 
be discarded, while ProSig performs assay design for 
different assay formats (e.g. PCR, TaqMan PCR and 
other probe-based PCR assays). Here, we demonstrate 
the sequential application of these algorithms for design 
of quantitative PCR (qPCR) assays specific for 
Acidobacteria Group 1. This group was chosen because 
it is ecologically important and technically challenging to 
target. The Acidobacteria are of ecological interest due to 
their high abundance in soils, ~20% of the total bacterial 
community (35), along with their response to pH (36-38), 
carbon (31,39), soil management (39-41) and elevated 
atmospheric carbon dioxide (42,43). The phylum 
contains 26 rRNA gene sequence similarity groups, 
labeled Groups 1 to 26 (44). Acidobacteria Group 1 is of 
particular interest because it contributes to cellulose 
degradation (39,45) and responds to ecosystem exposure 
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Figure 1. Non-signature primers can produce a target-specific assay. 
Primer sequences A and B occur in both the target and the background 
sequences, but only occur as a pair in target sequences. Instead of 
exploiting unique oligonucleotides (i.e. signatures), assays can exploit 
unique oligonucleotide combinations. 



to elevated atmospheric CO2 (42,43), a factor in global 
warming. 

The diversity of Acidobacteria Group 1 illustrates 
the scope of the assay design problem. There is little 
formal taxonomic structure within Group 1 . Formal struc- 
ture (i.e. the elucidation of taxonomically well-defined 
Orders, Families and Genera) requires analyses of 
representative bacterial cultures. These organisms are 
difficult to culture and there are only 5-sequenced 
genomes (S. Lucas, A. Copeland, A. Lapidus, J.-F. 
Cheng, L. Goodwin, S. Pitluck, H. Teshima, J. C. 
Detter, C. Han, R. Tapia et al., unpublished results; S. 
Lucas, A. Copeland, A. Lapidus, J.-F. Cheng, 
L. Goodwin, S. Pitluck, A. Zeytun, J.C. Detter, C. Han, 
R. Tapia, et al., unpublished results; S. Lucas, A. 
Copeland, A. Lapidus, J.-F. Cheng, L. Goodwin, S. 
Pitluck, A. Zeytun, J.C. Detter, C. Han, R. Tapia et al., 
unpublished results) (46) and 10 formally described 
cultured species (36,47-51) in Group 1. Consequently, di- 
versity within the Group is known mainly from rRNA 
gene sequences amplified from environmental samples. 
Among the 1339 nearly full-length (>1200bp) 16S 
rRNA gene sequences representing Acidobacteria Group 
1 in the RDP (33), the most divergent sequences are ~85% 
similar (39). Using a complete linkage clustering algorithm 
(33) and thresholds corresponding to 95, 90 and 85% 
sequence similarity, we found the number of clusters 
formed by the 1339 sequences at each threshold was 132, 
10 and 1, respectively. In taxonomic terms, these clusters 
provide a lower bound estimate of the number of genera, 
families and orders, illustrating the extensive breadth of 
diversity within the group. Many failed attempts have 
been made over the past decade to design PCR assays 
for Acidobacteria Group 1. Here, we demonstrate the 
success of the SeqStrap and ProSig programs in solving 
this complex design problem. 

MATERIALS AND METHODS 

Sequences for assay design 

Acidobacteria 16S rRNA gene sequences were obtained 
from the RDP release 10 (33) on 21 August 2010. The 
sequence lengths ranged from 262 to 1512 bp for the 
Group 1 Acidobacteria, and from 235 to 1697 bp for 
the non-Group 1 Acidobacteria. 

Hardware 

SeqStrap and ProSig were run on a 376-core, 
2.5 GHz Intel Xeon-based computer cluster with a 
gigabit Ethernet network. SeqStrap requires a CPU that 
supports Streaming SIMD Extensions (SSE) version 4.1 
(or higher) to compute the Smith-Waterman alignment of 
a single query sequence against four subject sequences in 
parallel using 32-bit alignment scores. 

SeqStrap and ProSig software 

The two algorithms are implemented in C++ and 
compiled with gcc to run on the Linux operating system. 
The MPI software library is used to parallelize the 
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calculation on a cluster of Linux computers. Each 
program is invoked independently via the command line 
or cluster batch scheduling software. Both ProSig and 
Seq Strap are publicly available as open source software 
(GNU public license version 2) and can be downloaded 
from http://public.lanl.gov/jgans. 

Sequence extrapolation algorithm 

SeqStrap is a computationally intensive algorithm that 
performs iterative, pair-wise sequence extrapolation as a 
preprocessing step before assay design. The algorithm, 
outlined in Figure 2, iteratively extends each partial 
target sequence by adding the overhanging sequence 
from the most similar target sequence (illustrated in 
Figure 3). The most similar sequences must have an align- 
ment score greater than a preset threshold and must be the 
highest scoring pair-wise alignment found by a search 
against all other target sequences. SeqStrap uses a conser- 
vative threshold score of 50 (see the scoring scheme 
described in the legend of Figure 2). 

To reduce the number of unproductive sequence com- 
parisons (i.e. alignments that do not contribute to 
sequence extrapolation), sequences are sorted by length 



do { 



sort sequences in ascending order by length 

for each query sequence in sequences { 

if (query sequence length < maximum sequence length) { 

compare query against all other sequences 

find best scoring Smith Waterman alignment with 
5 ' and/or 3 ' sequence that overhangs the query 

if (best alignment score > score threshold) { 

extrapolate query by adding overhanging 
sequence 

break out of for-loop 



} while still extrapolating 



Figure 2. Pseudocode description of the SeqStrap algorithm. Sequence 
alignments are computed using Smith-Waterman dynamic 
programming with the following scores: 2 for a match, —3 for a 
mismatch, —5 for gap existence and —2 for gap extension. No 
penalty is assigned for overhanging ends. When multiple sequence 
alignments produce the same highest score, the alignment with 
the largest amount of overhanging sequence is selected. When the 
amount of overhanging sequence is also the same, the order of the 
sequences in the input file is used to break the tie (to insure reprodu- 
cible results). Examples of overhanging sequence are shown in Figure 3. 
Two heuristic parameters are required: a score threshold and a 
maximum sequence length. The score threshold prevents spurious 
matches from creating sequence chimeras. Requiring a maximum 
sequence length was empirically found to prevent infinite extrapolation 
loops, e.g. terminal mismatches leading to an endless cycle of sequence 
A extrapolating sequence B, B extrapolating sequence C, and C 
extrapolating sequence A. All extrapolations used a score threshold 
of 50 and a maximum sequence length of 1500 bp. Because the 
maximum length criterion is applied before extrapolation, it is 
possible to produce extrapolated sequences that exceed the maximum 
sequence length. 



in ascending order at the beginning of each iteration, 
and the shortest sequences are extrapolated first. 
Overlap alignments between pairs of sequences are 
computed with Smith-Waterman dynamic programming 
(52). Because only the alignment score and the coordinates 
of overhanging sequence are needed (as opposed to a 
detailed pair-wise alignment), SeqStrap uses a linear-space 
variant of the Smith-Waterman dynamic programming 
that only stores two rows in the dynamic programming 
matrix (52). This avoids the potential storage limitation in 
aligning longer (>10 5 bp length) nucleic acid sequences 
(the dynamic programming matrix size is proportional 
to the product of the aligned sequence lengths). 

To reduce the computational burden of extrapolation 
with large numbers of sequences, SeqStrap exploits two 
levels of computer parallelism to compute the independent 
pair-wise alignments. At the lowest level, alignments 
between the query sequence (to be extrapolated) and the 
subject sequences (sources of extrapolated sequence) are 
computed in parallel (53). By using the 128-bit SSE avail- 
able on modern Intel and AMD CPUs, four independent 
alignments (using 32-bit scores) can be computed in 
parallel. At the highest level, all sequences are uniformly 
distributed between available CPU cores in a cluster 
computer using the MPI parallel toolkit. For each query 
sequence, every core computes the alignments between the 
query and the locally stored subject sequences. 

Even with this parallel implementation, 1 week on a 
376-core cluster represents a significant investment of 
computational resources. The expended computer time 
reflects our choice of problem (16S rRNA gene sequences, 
one of the most abundant sequence types in public data- 
bases) and the desire for highly accurate, overlap sequence 
alignments (using Smith-Waterman dynamic progra- 
mming). The required running time of the extrapolation 
algorithm is difficult to predict a priori and depends on 
several factors, including number of sequences, distribu- 
tion of sequence lengths, pair-wise similarities between 
sequences and cluster hardware. However, a significant 
speed up in algorithm performance could be obtained by 
replacing the Smith-Waterman alignment algorithm with 
a faster (likely heuristic), sequence overlap alignment 
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(b) add the 5'-overhanging sequence from the best match, and 

(c) add both the 5'- and 3'-overhanging sequence from the best match. 
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algorithm. The open source implementation of the 
SeqStrap algorithm can serve as a benchmark for future 
algorithmic improvements that trade alignment accuracy 
for alignment speed. 

PCR primer design algorithm 

ProSig designs PCR, TaqMan PCR and probe-based 
group-specific assays by large scale assay enumeration 
followed by subtraction of enumerated assays that poten- 
tially cross-react with non-target sequences (Figure 4). A 
greedy set-coverage solver then identifies the minimal sets 
of assays required to detect all targets. Multiple sequence 
alignments and signature oligonucleotides are not used in 
the design process. Large numbers of target sequences 
(ranging in size from genes to bacterial genomes) can be 
processed. In addition, because each forward and reverse 
primer will participate in multiple assays, we only perform 
computationally expensive, pair-wise sequence alignments 
once for each primer and template sequence. Finally, we 
exploit parallel computing to distribute both the required 
storage and sequence comparison calculations across a 
cluster of computers. 

Assay enumeration is computationally demanding, but 
feasible with a mid-sized cluster (i.e. hundreds of cores). 
When enumerating assays from a selected target sequence, 
all valid forward and reverse primers from the core 
(i.e. non-extrapolated) region of the selected targets are 
enumerated. Valid primers are defined by user-specified 
constraints in length, primer-template melting tempera- 
ture, primer-hairpin melting temperature and other 
common guidelines for primer design (Table 1). Melting 
temperatures are calculated with a standard nearest- 
neighbor thermodynamic model (54) that provides both 



Figure 4. Pseudocode description of the greedy PCR-assay enumer- 
ation algorithm. Core sequence refers to the original, non-extrapolated 
sequence. Primer and assay enumeration used the target parameters 
listed in Table 1, while the removal of assays used the non-target par- 
ameters listed in Table 1. 



melting temperature and free energy change given a 
pair-wise sequence alignment that can include mismatches 
and gaps. 

The potential number of forward and reverse primer 
candidates is of order 2(nL), where L is the size of the 
core target sequence (e.g. ~1000bp for 16S rRNA gene 
targets), and n is the number of allowed primer lengths 
(typically less than 10 different lengths ranging from 20 to 
30 bp). While each primer can potentially participate in 
multiple PCR assays, the computationally expensive 
primer-template and primer-hairpin melting temperatures 
are only calculated once for each primer. When computing 
the primer-template melting temperature, only the perfect 
match is considered. Combining the enumerated primers, 
the number of possible PCR assays is order (AAn 2 L), 
where A A is the range of allowed amplicon sizes 
(i.e. largest amplicon size — smallest amplicon size; typic- 
ally <200 bp). The calculation of the primer-dimer melting 
temperature for the forward and reverse primers must be 
computed for every assay. This calculation dominates the 
enumeration of valid assays from a single target sequence 
because the number of primer combinations is far greater 
than the number of primers. In practice, the number of 
melting temperature calculations is less than order 
(AAn 2 L), because only a fraction of possible primers 
satisfy the constraints shown in Table 1, but a general 
computational framework is still needed that can 
perform ~10 7 -10 10 independent melting temperature cal- 
culations per target sequence. The ProSig program 
provides this framework by distributing the melting tem- 
perature calculations in parallel on multiple CPUs in a 
computer cluster. 

The computational cost of performing PCR-assay sub- 
traction (i.e. matching and removing assays that poten- 
tially cross-react with non-target sequences) is of order 
(2nL), which is far less demanding than the assay enumer- 
ation step. The melting temperature is calculated for each 
primer bound to each non-target sequence. Since primer 
binding can potentially occur anywhere in a non-target 
sequence, melting temperature calculations are initiated 
for all non-target sequence subregions that share at least 



Table 1. PCR assay design constraints 



Allowed target amplicon lengths 


70-300 bp 


Allowed primer lengths 


18-28 bp 


Allowed target primer T m 


59-63° C 


Maximum primer hairpin T m 


40° C 


Maximum primer dimer T m 


40°C 


Minimum target primer 3' clamp 


5 bp 


Maximum non-target primer T m 


45°C 


Maximum non-target amplicon length 


1000 bp 



The constraints used to (a) enumerate PCR primers and assays from 
target sequences and (b) subtract PCR assays that matched non-target 
sequences. An assay matched a non-target sequence if both primers 
bound in the correct orientation, with melting temperatures 
(T' m )>45°C, and produced an amplicon < 1000 bp. All melting tem- 
peratures were computed by a thermodynamic alignment (55) 
between primer and template, using the nearest neighbor parameters 
of SantaLucia (54), and assuming a primer concentration of 9 x 10~ 7 M 
and a salt concentration of 0.05 M. Target match criteria are more 
stringent that non-target match criteria. 



while there are targets to amplify { 

select target with largest core sequence 

enumerate all valid forward primers 

enumerate all valid reverse primers 

enumerate all valid PCR assays 

for each non-target sequence { 

remove matching PCR assays 

} 

for each PCR assay { 

find matching target sequences 

store assay (s) that matches largest number of 
targets 

} 

write assay(s) that matches largest number of 
targets 

remove targets amplified by assay (s) that matches 
largest number of targets 

} 
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six consecutive bases of perfect complementarity with a 
given primer (identified by a hash table lookup). All 
non-target sequence binding sites with melting tempera- 
tures above the user-defined threshold are stored for 
each primer. Assays that could conceivably produce 
amplicons with non-target sequences are rejected. Assay 
rejection occurs if two primer binding sites occur in the 
correct orientation, within the allowed non-target 
distance, and with melting temperatures above the 
non-target T m threshold (Table 1). Also, at least one of 
the primers must have one or more 3' terminal bases that 
are perfectly matching to the non-target sequence to be 
suitable for polymerase extension. If both primers have 
3' terminal mismatches to the non-target sequence, then 
the assay is not rejected. All possible primer pair combin- 
ations are tested (i.e. forward/reverse, forward/forward 
and reverse/reverse). As with the enumeration step, the 
assay subtraction step is performed in parallel for each 
non-target sequence. Only those assays that survive the 
subtraction process for the i-th non-target sequence are 
searched against the i+\ non-target sequence. 

When searching candidate assay signatures against 
multiple non-target sequences, we sorted the non-target 
genomes by degree of similarity to the target genome 
(most similar first) in order to remove the largest 
number of candidate signatures as early in the calculation 
as possible. Because candidate signatures that match a 
non-target sequence are removed from further consider- 
ation, this strategy reduces the number of signature 
searches that must be performed for subsequent non- 
target sequences. For the purpose of prioritizing 
non-target genomes for signature subtraction, genome 
similarity is defined by the magnitude of the difference 
between the normalized dinucleotide composition vectors 
for each genome [similar to (56), but using vectors of raw 
dinucleotide counts normalized to length one]. 

After assay subtraction, all remaining assays are pre- 
dicted to be specific to one or more target sequences 
within the group. Identification of the smallest number 
of assays required to amplify all target sequences in the 
group is a standard set coverage problem (11). We used a 
standard greedy heuristic solution (57); after every iter- 
ation of enumeration and subtraction for a selected 
target sequence, the algorithm stores the assay(s) that 
amplify the largest number of previously unamplified 
target sequences. Target sequences amplified by previously 
stored assays are ignored. Not every target sequence is 
guaranteed to yield a specific assay (e.g. if the same 
sequence appears in both the target and non-target 
categories, then none of the enumerated assays will 
survive non-target subtraction). 



The algorithm uses a process loop that consists of 

(i) selecting a previously unamplified target sequence, 

(ii) enumerating valid assays, (iii) subtracting assays 
that match non-target sequences, and (iv) finding the 
assay (or assays) that amplifies the largest number of 
previously unamplified targets. This process loop con- 
tinues until all targets have been selected or amplified 
(or the process is manually terminated). 

This approach fails in easily recognized modes 
when challenged with mislabeled sequences. If a target 
sequence is accidentally included in the non-target 
category, it will be 'first in line' for assay subtraction by 
virtue of its high similarity to other target sequences, and 
will quickly eliminate the majority of assay candidates. If a 
non-target sequence is accidentally included in the target 
category, its assays will be subtracted by other non-target 
sequences. Any assays that do survive (due to unique 
features in the mislabeled sequence) will be unlikely to 
detect other correctly labeled target sequences, and the 
set coverage algorithm will require a separate assay just 
to detect the mislabeled sequence. If a sequence that is 
dissimilar to both target and non-target sequences is 
included in the target category, it will generate a large 
number of assays that survive the non-target subtraction 
step but only detect the dissimilar sequence. Sequences 
that are dissimilar to both target and non-target sequences 
and are included in the non-target category will not affect 
the output (since they will not eliminate any assays during 
the subtraction step). 

qPCR assays 

Assays were performed with Biorad iQ SyBr Green 
Supermix and three primer sets designed for specific 
detection of Acidobacteria Group 1 (Table 2). Primers 
were obtained from Invitrogen. Each 25 ul qPCR assay 
contained primers at 0.2 uM. Cycling conditions were as 
follows: 94°C for 5.0min; 40 cycles of 94°C for 15s, 65°C 
for 30 s; a melt curve of 91 cycles, 30 s each, ramping 0.5°C 
per cycle from 50.0 to 95.0°C; 4.0°C storage. Standard 
curves for primer sets AcidoGl_8.1 and AcidoGl_8.17 
were generated with purified, genomic DNA from 
Acidobacterium capsulation ATCC 51196 (hereafter 
A. capsulatum). A standard curve was not obtained for 
primer set AcidoGl_8.2 because it does not amplify 
A. capsulatum. Assays were applied to soil DNA 
samples from a field experiment in Rhinelander, 
Wisconsin, described in (42). 

16S rRNA gene clone libraries 

Amplicons from triplicate qPCR reactions for each of 
three replicate soil DNA samples were pooled (i.e. n = 9 



Table 2. PCR primer pairs for the specific detection of Acidobacteria Group 1 16S rRNA gene sequences 



Name 


5'-Forward-3' 


5'-Reverse-3' 


Amplicon 








length (bp) 


acidoGl 8.1 


GAACCTTACCTGGGCTCGAAA 


GTGCTCAACTAAATGGTAGCAACTG 


214 


acidoGl 8.2 


GGTGCGTGGAATTCCCGG 


GCGGATTGCTTATCGCGTTAG 


229 


acidoGl_8.17 


CCCTTGGGACGTAAACTCCTT 


TTCCACGCACCTCTCCCA 


306 
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reactions), cloned and sequenced to evaluate assay speci- 
ficity. Amplicons were purified (Qiagen QIAquick PCR 
Purification Kit) prior to cloning (Invitrogen TOPO TA 
Cloning Kit). For each qPCR primer set, 192 clones were 
picked and the cloned 16S rRNA gene fragments were 
bi-directionally sequenced with Ml 3 primers. 

Sequence processing 

Assembled sequences were visually inspected in 
Sequencher v4.7 (Ann Arbor, MI) to confirm the 
sequences were full length, as indicated by the presence 
of forward and reverse primer sites. A total of 149, 158 
and 163 useable sequences were obtained for primer sets 
AcidoGl_8.1, AcidoGl_8.2, AcidoGl_8.17, respectively. 
Sequences were aligned in SILVA (58). Aligned sequences 
were compiled in a single database in ARB (59). 

Phylogenetic specificity 

Phylogenetic placement of the amplicon sequences from 
qPCR assays was determined in two ways. First, the 
sequences were classified by an automated classifier in 
the RDP (60). Secondly, the sequences were added by 
quick Parsimony to a phylogenetic guide tree in ARB. 
The guide tree was generated with 145 nearly full-length 
(ranging from ~1000 to 1500 bp) reference sequences 
representing Acidobacteria Groups 1, 3, 4, 5, 6, 7, and 8. 
Archangium gephyra (P-Proteobacteria, AB2 18222), 
Solimonas soli (y-Proteobacteria, EF067861), Chitinibacter 
tainanensis (P-Proteobacteria, AY264287) and Leeia 
oryzae (P-Proteobacteria, DQ280369) were used as add- 
itional out groups. The maximum likelihood algorithm 
(RAxML) in ARB (59) was used to generate the guide 
tree with a 70% base frequency filter generated with 
Acidobacteria reference sequences representing Groups 1, 
3, 4, 5, 6, 7 and 8. Amplicon sequences from the Acidoba- 
cteria Group 1 assays were added to the guide tree using 
the ARB parsimony algorithm, since the amplicons were 
short and represented different regions of the 16S rRNA 
gene. 

qPCR fold-change analysis 

The fold-change in relative abundance of groups targeted 
by qPCR assays was calculated as follows: (amplification 
efficiency) [(mean CT amb ™'H>"™ ct elevated co2)] In this ca ,_ 

culation, the amplification efficiency term was computed 
from the slope of the PCR standard curve generated with 
A. capsulatum using e ( ~ i/siope)^ y^e amplification 
efficiency was 1.85 and 1.94 for assays AcidoGl_8.1 and 
AcidoGl_8.17. The amplification efficiency of the third 
assay, AcidoGl_8.2, was not determined because the 
assay does not amplify A. capsulatum (the only available 
genomic DNA for the test). Therefore, the amplification 
efficiency of the first assay, AcidoGl_8.1, was used 
to approximate the fold-change calculations for 
AcidoGl_8.2. Similar results were obtained when using 
the amplification efficiency of AcidoGl_8.17 as a substi- 
tute value. 



Sequence deposition 

The 470 amplicon sequences from the three assays are 
included in both the supplementary online material and 
the ProSig software package. 

RESULTS AND DISCUSSION 

Sequence extrapolation 

To maximize the number of sequence inputs for PCR- 
assay design, 8430 Acidobacteria Group 1 sequences and 
25254 other Acidobacteria 16S rRNA gene sequences 
ranging from 235 to 1697 bp in length were separately 
processed with SeqStrap. Sequence extrapolation took 
about 1 week on a 376-core cluster. Extrapolation 
increased the length of 96% of the Acidobacteria 
Group 1 sequences and 94% of the other Acidobacteria 
sequences. Extrapolation increased the average length of 
target sequences from 759 to 1469 bp and the average 
length of non-target sequences from 826 to 1512 bp. 
Sequence extrapolation occurs only between a sequence 
and the single most similar matching sequence, identified 
by the largest pair-wise sequence alignment score (above a 
set threshold) among the pool of sequences to be 
extrapolated. The default alignment-scoring scheme and 
the minimum allowed score for sequence extrapolation 
(Figure 2) allows extrapolation between pairs of sequences 
with (gap free) sequence identities >60% (in the long 
sequence alignment limit). This constraint reduces, but 
does not eliminate, the generation of chimeric artifacts 
by extrapolation. However, since our group-specific 
assay design process works by enumerating candidate 
assays only from the unextrapolated portion of target 
sequences, chimera-specific assays will not arise. It is still 
possible for chimeric sequences to cause a reduction in 
predicted assay coverage (i.e. the number of target 
sequences detected by a given assay). To our knowledge, 
the SeqStrap algorithm is a unique approach that over- 
comes the challenge of fragmentary data in sequence data- 
bases and, thus, maximizes the data that can be exploited 
for assay design. In our pipeline, the extrapolated 
sequences are used during the design process to assess 
the extent to which assays can amplify existing sequences. 

Design of PCR assays specific for Acidobacteria Group 1 

ProSig was used to enumerate target-specific assays. The 
Acidobacteria Group 1 sequences did not have a specific 
signature that could be used for a PCR assay. The lack of 
a signature and the large number of sequences made this 
target group a suitable test of ProSig's design capability. 
The cumulative fraction of Acidobacteria Group 1 
sequences covered by the assays was monitored in 
real-time by computationally searching the assays 
against the sequences extrapolated by SeqStrap, and for 
comparison, to the original, unmodified (i.e. unextra- 
polated) sequences. The first four assays were predicted 
to cover ~50% of the extrapolated Acidobacteria Group 
1 sequences, whereas about 15 assays were predicted to 
cover 50% of the unextrapolated sequences (Figure 4). 
This illustrates one of the challenges posed by use of 
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unextrapolated sequences that vary substantially in length 
and overlap. After the first 40 assays (providing ~95% 
predicted cumulative coverage), the predicted coverage 
of Acidobacteria Group 1 extrapolated sequences plat- 
eaued (Figure 5). We terminated the ProSig run after 
generating 83 assays because further improvements in pre- 
dicted coverage were marginal (Figure 5). The 83 assays 
were predicted to collectively cover ~98% of the 
Acidobacteria Group 1 sequences and required ~48h to 
compute on our 376-core cluster. The forward primers for 
the assays collectively targeted about 24 distinct locations 
in the 16S rRNA gene. The reverse primers targeted about 
19 locations. The inability to detect the entire target group 
with a single assay emphasizes the difficulty of the assay 
design problem. Similar results were obtained using the 
Actinomycetales as a target group (data not shown), 
demonstrating that Acidobacteria Group 1 is not unique 
in posing a design challenge, and emphasizing the need for 
a robust, flexible design platform like ProSig. The ability 
of ProSig to predict a minimal set of assays to cover 
Acidobacteria Group 1 illustrates its flexibility and value. 

Three Acidobacteria Group 1 assays were chosen for 
experimental testing. To find the optimum subset of 
three assays predicted to collectively amplify the largest 
fraction of extrapolated target sequences, a brute force 
search was performed with all 91 881 possible three-assay 
combinations from the pool of 83 target-specific assays 
[i.e. the number of three assay combinations = 83!/ 
(3! x 80!)]. The Perl script used to perform this search is 
included with the ProSig software. It is not possible to 
determine an optimal subset of assays simply by 
examining Figure 5, because each assay may detect 
overlapping subsets of the target taxa. In addition, 
because the number of possible assay combinations 

mo 




Assay number 



Figure 5. Cumulative coverage of 8430 Acidobacteria group 1 16S 
rRNA gene sequences by 83 target-specific PCR assays. Coverage is 
the percentage of the extrapolated or unextrapolated (unmodified) 
target sequences amplified (in silico). The bar height represents the 
extrapolated (white bars) or unextrapolated (black bars) target 
coverage of individual assays. The dashed and solid lines represent 
the cumulative extrapolated and unextrapolated target coverage, 
respectively. PCR assays are plotted along the x-axis in the order of 
discovery. The maximum cumulative coverage is 98% for the 
extrapolated targets and only 84% for the unextrapolated targets. 
The coverage of individual assays ranges from a high of 23% to a 
low of 5% for the extrapolated targets, and from a high of 11% to a 
low of 1% for the unextrapolated targets. 
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grows rapidly, the brute force solution to the set-coverage 
problem is only practical for small combinations (i.e. less 
than 5). The optimum combination of three assays 
was AcidoGl_8.1, AcidoGl_8.17 and AcidoGl_8.2 
(Table 2). Using E. coli numbering, the assays, respect- 
ively, target the following three regions of the 16S 
rRNA gene: basepair 906 to 1071 (between variable 
regions V5 and V6), bp 373 to 607 (encompassing V4) 
and basepair 596 to 813 (encompassing V3), respectively. 
The three assays were predicted to collectively cover 42% 
of the extrapolated set of 8430 Acidobacteria Group 1 
sequences. The predicted clade coverage of the three 
assays is shown in Supplementary Figure SI. None of 
the primers in the three assays were 'signature' primers, 
illustrating the unique ability of ProSig to design assays 
for target groups that lack group-specific signatures. 

Validation of assay specificity 

The specificity of the assays was confirmed by sequencing 
PCR amplicons derived from soil samples. Soil is an 
excellent test-bed for assay specificity because the 
hyperdiversity of species in soil microbial communities 
(61,62) increases the opportunity for PCR primer 
cross-reactivity, and thus, loss of specificity. The 
RDP classifier (60) identified nearly all of the 149 (for 
AcidoGl_8.1), 158 (for AcidoGl_8.2) and 163 
(for AcidoGl_8.17) sequenced amplicons from each 
assay as Acidobacteria Group 1. The exceptions were 16 
sequences from AcidoGl_8.1, which could not be classi- 
fied with a confidence score >70%. This result was not 
surprising because the region of the 16S rRNA gene 
targeted by AcidoGl_8.1 does not provide good resolving 
power for the RDP classifier (60). More reliable classifica- 
tion was obtained by aligning sequences using a 16S 
rRNA gene-specific alignment strategy followed by recon- 
struction of a phylogenetic tree (60). When placed in a 
phylogenetic tree, all of the amplicon sequences fell 
within Acidobacteria Group 1 (Figure 6), confirming the 
predicted specificity of the assays. 

Application to microbial ecology 

The three group-specific assays were used to evaluate the 
response of Acidobacteria Group 1 in a field experiment 
focused on terrestrial ecosystem responses to a decade of 
elevated atmospheric C0 2 (42). The field experiment 
included three replicate field plots under ambient or 
elevated atmospheric C0 2 conditions. A composite soil 
sample was obtained from each plot, yielding six 
samples total. Exploratory Sanger-based, 16S rRNA 
gene surveys (about 270 sequences each) of the six 
samples showed a 2-fold decrease in the relative abun- 
dance of sequences classified as Acidobacteria Group 1 
in plots under elevated C0 2 (Figure 7), but the difference 
was not statistically significant by a pair-wise f-test (42). 
Subsequently, Pyrotag-based, single subunit amplicon 
libraries (100-fold larger that the Sanger surveys) of the 
same samples showed a significant (P = 0.017), 3-fold 
decrease in Acidobacteria Group 1 under elevated C0 2 . 
To investigate further, the three qPCR assays, 
AcidoGl_8.1, AcidoGl_8.17 and AcidoGl_8.2 were 
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Figure 6. Maximum-likelihood tree of Acidobacteria 16S rRNA gene sequences. Groups identified as 1, 3, 4, 5, 6, 7 and 8 are indicated to the right 
of the group. Sequences obtained from soil by PCR amplification with Acidobacteria Group 1 primer sets are indicated by color. The total number of 
sequences in each cluster (including both sequenced amplicons and reference sequences) is labeled at the cluster node. The number of sequenced 
amplicons for each respective PCR assay is shown in parentheses adjacent to the colored wedge. Wedges are colored to show the approximate 
fraction of amplicon from each PCR assay in the cluster (blue for 8.1, gold for 8.2 and purple for 8.17). Clusters containing reference sequences from 
cultured isolates are indicated. The scale bar indicates 0.10 changes per nucleotide. 



applied to the same samples. Although the three assays do 
not detect all members of Acidobacteria Group 1, we 
expected the assays would cover a sufficient number of 
species to capture the Group 1 responses observed in the 
16S rRNA gene surveys. With AcidoGl_8.1, there was no 
difference in abundance between ambient and elevated 
C0 2 samples. In contrast, ~5-fold decreases in abundance 
under elevated C0 2 were measured with AcidoGl_8.17 
and AcidoGl_8.2, and the differences were significant 
(P = 0.0083, P = 0.0023, respectively). The results 



demonstrate the value of the assays in validating and 
further characterizing results obtained from broader 
survey techniques. The ability to identify responsive 
target groups from survey data and rapidly follow-up by 
designing and applying group-specific assays is an import- 
ant capability in microbial ecology. 

Other software 

We are aware of only one other program that uses an 
approach comparable to ProSig. Like ProSig, the PRISE 
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Figure 7. Decrease in the relative abundance of Acidobacteria Group 1 
in a field experiment measured by different techniques applied to six 
soil DNA samples from control (ambient atmospheric C0 2 ) and treated 
(elevated atmospheric C0 2 ) plots. The Sanger estimate was obtained 
from surveys of about 300 16S rRNA gene sequences from each soil 
DNA sample. The Pyrotag estimate was obtained from surveys of 
about 50000 16S rRNA gene sequences from each sample. The vari- 
ation in the estimates is presumed to arise from the increasing quality 
of the assays (from left to right) and the specific members of 
Acidobacteria Group 1 sampled by each technique. 



program permits discovery of specific PCR assays 
composed of non-signature primers, but the program 
has several limitations that severely constrain its capabil- 
ity. The program performs a limited enumeration of 
valid, target-sensitive PCR assays. Screening the 
enumerated assays against all non-target sequences 
then identifies target-specific assays. However, this 
program requires a greedy enumeration, ignoring 
primers that do not match a significant fraction of the 
target sequences. When this constraint is omitted, the 
program exhausts the available memory, even for small 
problem sizes (i.e. a small number of target and query 
sequences). Thus, this program is suitable only for 
problems in which most of the targets can be detected 
by an assay. 



CONCLUSION 

We developed a computational methodology that exploits 
available nucleic acid sequence information at an unpre- 
cedented scale and level of complexity to develop robust 
PCR-based detection assays. This methodology enables 
the design of group-specific assays for a wide range of 
applications in microbial ecology, public health and agri- 
cultural safety. The last 20 years have witnessed an expo- 
nential growth in the number of available nucleic acid 
sequences. Improved computational tools that can scale 
with the growth of sequence information are still 
needed. The SeqStrap and ProSig help address this need. 
The software has been released as open source to facilitate 
future improvements in parallel scalability and novel assay 
design algorithms. 
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